Real-world Data Wrangling¶

In this project, you will apply the skills you acquired in the course to gather and wrangle real-world data with two datasets of your choice.

You will retrieve and extract the data, assess the data programmatically and visually, accross elements of data quality and structure, and implement a cleaning strategy for the data. You will then store the updated data into your selected database/data store, combine the data, and answer a research question with the datasets.

Throughout the process, you are expected to:

  1. Explain your decisions towards methods used for gathering, assessing, cleaning, storing, and answering the research question
  2. Write code comments so your code is more readable

1. Gather data¶

In this section, you will extract data using two different data gathering methods and combine the data. Use at least two different types of data-gathering methods.

1.1. Problem Statement¶

In this Project the objective is to observe crash and collision data of two big cities in the United States. Chicago and New York. The objective is to determine:

  • Which city has a higher kill count when crashes occur, Chicago or New york city?

  • Between both Chicago and NewYork. Which date or time of the year has recorded the highest fatalaties in general between both cities?

The data that will be used and the source¶

The two datasets that will be used for this project can be found on the Catalog-data-.gov website.

  1. The Chicago dateset which can be found here: Chicago-motor-vehicle-crashes

  2. The NewYork dataset which can be found here: NewYork-motor-vehicle-collisions

1.2. Different data gathering methods¶

List of data gathering methods:

  • Download data manually
  • Programmatically downloading files
  • Gather data by accessing APIs
  • Gather and extract data from HTML files using BeautifulSoup
  • Extract data from a SQL database

Data Gathering Methods used for this Project.¶

For this project the TWO Data Gathering methods that will be chosen from the list above for this project are:

  1. Download data manually

  2. Gather data by accessing APIs

SEtting Things up¶

Here, the Libraries will be imported.

In [1]:
# import libraries
import os
import numpy as np
import pandas as pd
import requests
import bs4
import sklearn
import PIL
import matplotlib.pyplot as plt
import seaborn as sb
import plotly.express as px
import missingno as msn

%matplotlib inline

Dataset 1¶

The first dataset that will be used is the Traffic Crahses in Chicago.

The data type file is a CSV file. The file name is traffic-crashes-chicago.csv

Method of loading the data¶

The method used here is to open the CSV file manually. The file has already been downloaded locally in the downloads folder. It it will be opened using Pandas load to CSV method.

In [2]:
# Using Pandas 
sub = os.path.join('datasets', 'traffic-crashes-chicago.csv')
chicago_rawdata = pd.read_csv(sub)

# Load the dataset
chicago_rawdata.head()
Out[2]:
CRASH_RECORD_ID RD_NO CRASH_DATE_EST_I CRASH_DATE POSTED_SPEED_LIMIT TRAFFIC_CONTROL_DEVICE DEVICE_CONDITION WEATHER_CONDITION LIGHTING_CONDITION FIRST_CRASH_TYPE ... INJURIES_NON_INCAPACITATING INJURIES_REPORTED_NOT_EVIDENT INJURIES_NO_INDICATION INJURIES_UNKNOWN CRASH_HOUR CRASH_DAY_OF_WEEK CRASH_MONTH LATITUDE LONGITUDE LOCATION
0 79c7a2ce89f446262efd86df3d72d18b04ba487024b7c4... JC199149 NaN 03/25/2019 02:43:00 PM 30 TRAFFIC SIGNAL FUNCTIONING PROPERLY CLEAR DAYLIGHT TURNING ... 0.0 1.0 2.0 0.0 14 2 3 41.884547 -87.641201 POINT (-87.64120093714 41.884547224337)
1 792b539deaaad65ee5b4a9691d927a34d298eb33d42af0... JB422857 NaN 09/05/2018 08:40:00 AM 30 NO CONTROLS NO CONTROLS CLEAR DAYLIGHT ANGLE ... 0.0 0.0 2.0 0.0 8 4 9 41.968562 -87.740659 POINT (-87.740659314632 41.968562453871)
2 0115ade9a755e835255508463f7e9c4a9a0b47e9304238... JF318029 NaN 07/15/2022 12:45:00 AM 30 UNKNOWN UNKNOWN CLEAR DARKNESS, LIGHTED ROAD ANGLE ... 0.0 0.0 2.0 0.0 0 6 7 41.886336 -87.716203 POINT (-87.716203130599 41.886336409761)
3 017040c61958d2fa977c956b2bd2d6759ef7754496dc96... JF324552 NaN 07/15/2022 06:50:00 PM 30 TRAFFIC SIGNAL FUNCTIONING PROPERLY CLEAR DAYLIGHT REAR END ... 0.0 0.0 2.0 0.0 18 6 7 41.925111 -87.667997 POINT (-87.667997321599 41.925110815832)
4 78eee027ec3dcc85d36c9e3fdae4729dcc56440105d65b... JB291672 NaN 06/03/2018 05:00:00 PM 30 NO CONTROLS NO CONTROLS CLEAR UNKNOWN PARKED MOTOR VEHICLE ... 0.0 0.0 1.0 0.0 17 1 6 41.910758 -87.731389 POINT (-87.731388754145 41.910757551599)

5 rows × 49 columns

From the above it can be seen that the traffic-crashes-chicago.csv dataset has been successfully loaded using Pandas. This is assigned to the variable dataset_one.

Small Summary about Dataset 1¶
  • From above we picked this CHicago dataset because it is vast and the data seems more kept to date. May be being as Chicago is one of the biggest cities in the world its accident count must be pretty high so accident data must be vital.

  • The gathering method used from above was to download the dataset and load the CSV file locally from the PC.

  • SOme interesting variables picked up at a glance were CRASH_DATE, LOCATION and CRASH_MONTH

Dataset 2¶

The next dataset used will be Motor Vehicle Collisions-Crashes in New York.

The datatype is a JSON file on a webpage.

Method of loading the data¶

Loading the data here will be slightly different from the previous method in Dataset 1. Here we will scrape the data from the API found on the webpage here Motor-Vehicle-Collisions-NYC which provides the data in JSON format. Then, the data will be converted to a Pandas Dataframe.

In [3]:
# THE URL Link where the data is located

URL_link = 'https://data.cityofnewyork.us/resource/h9gi-nx95.json'

# using Pandas to read into a JSON file

newyork_rawdata_json = pd.read_json(URL_link)

# Load the dataset
newyork_rawdata_json.head()
Out[3]:
crash_date crash_time on_street_name off_street_name number_of_persons_injured number_of_persons_killed number_of_pedestrians_injured number_of_pedestrians_killed number_of_cyclist_injured number_of_cyclist_killed ... latitude longitude location cross_street_name contributing_factor_vehicle_3 vehicle_type_code_3 contributing_factor_vehicle_4 vehicle_type_code_4 contributing_factor_vehicle_5 vehicle_type_code_5
0 2021-09-11T00:00:00.000 2023-05-21 02:39:00 WHITESTONE EXPRESSWAY 20 AVENUE 2 0 0 0 0 0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 2022-03-26T00:00:00.000 2023-05-21 11:45:00 QUEENSBORO BRIDGE UPPER NaN 1 0 0 0 0 0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 2022-06-29T00:00:00.000 2023-05-21 06:55:00 THROGS NECK BRIDGE NaN 0 0 0 0 0 0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 2021-09-11T00:00:00.000 2023-05-21 09:35:00 NaN NaN 0 0 0 0 0 0 ... 40.667202 -73.866500 {'latitude': '40.667202', 'longitude': '-73.86... 1211 LORING AVENUE NaN NaN NaN NaN NaN NaN
4 2021-12-14T00:00:00.000 2023-05-21 08:13:00 SARATOGA AVENUE DECATUR STREET 0 0 0 0 0 0 ... 40.683304 -73.917274 {'latitude': '40.683304', 'longitude': '-73.91... NaN NaN NaN NaN NaN NaN NaN

5 rows × 29 columns

Small summary of Dataset 2¶
  • When doing a project about accidents the first city I automatically searched for was Newyork. New York is the most robust city with the most traffic and a high population and high movement. So I had to include New york in my findings

  • The gathering method here as mentioned was download the data which a online JSON file on the website using a API. The data successfully downloaded on the machine. (Below I will make a optional step to save this data as a CSV locally0

  • Some interesting variables noticed at first glance were crash_date, crash_time and numbers_of_persons_injured

  • FInally, the New york dataset was not as big as the Chicago but it was still very fruitful with data and very much kept up to date.

Optional data storing step: Since DATASET 1 was loaded manually, the dataset is already saved locally. DATASET 2 was loaded from the URL. Therefore we will save the raw datasetlocally now.

In [4]:
#Optional: store the dataset locally from the downloaded Json file online
newyork_rawdata_json.to_csv('datasets\dataset_two.csv', index=False)

# load downloaded csv file of raw data 
# This is now the raw data file
newyork_rawdata = pd.read_csv('datasets\dataset_two.csv')

print ('Successfully created Local backup for Newyork data!')
Successfully created Local backup for Newyork data!

Within the local data folder we now have our two raw datasets

2. Assess data¶

Assess the data according to data quality and tidiness metrics using the report below.

List two data quality issues and two tidiness issues. Assess each data issue visually and programmatically, then briefly describe the issue you find. Make sure you include justifications for the methods you use for the assessment.

Quality Issue 1: Checking for Missing Data and dropping Uncessary columns¶

Checking Issue one Visually¶

We are going to first inspect the two dataset variables by plotting visualizations. Here we will dirst the First quality issue which is seeing how much missing data is present in the two datasets respectively

In [6]:
#FILL IN - Inspecting the dataframe visually for chicago dataset
msn.bar(chicago_rawdata)
Out[6]:
<Axes: >
In [23]:
msn.bar(newyork_rawdata)
Out[23]:
<Axes: >

We can see the missing data from each of the datasets in two different ways. WE see the columns with missing rows of data.

QUALITY ISSUE ONE PROGRAMATICALLY¶

Here we are going to check two things:¶
  1. The total sum of null values
  2. The info of the datasets such as non-null values, data types length of columns. This helps to determine which columns to drop
In [6]:
# Using Isnull.sum() to see see the actually columns of missing data in each dataframe
chicago_rawdata.isnull().sum()
Out[6]:
CRASH_RECORD_ID                       0
RD_NO                              4016
CRASH_DATE_EST_I                 654628
CRASH_DATE                            0
POSTED_SPEED_LIMIT                    0
TRAFFIC_CONTROL_DEVICE                0
DEVICE_CONDITION                      0
WEATHER_CONDITION                     0
LIGHTING_CONDITION                    0
FIRST_CRASH_TYPE                      0
TRAFFICWAY_TYPE                       0
LANE_CNT                         509267
ALIGNMENT                             0
ROADWAY_SURFACE_COND                  0
ROAD_DEFECT                           0
REPORT_TYPE                       19672
CRASH_TYPE                            0
INTERSECTION_RELATED_I           545779
NOT_RIGHT_OF_WAY_I               675089
HIT_AND_RUN_I                    488097
DAMAGE                                0
DATE_POLICE_NOTIFIED                  0
PRIM_CONTRIBUTORY_CAUSE               0
SEC_CONTRIBUTORY_CAUSE                0
STREET_NO                             0
STREET_DIRECTION                      4
STREET_NAME                           1
BEAT_OF_OCCURRENCE                    5
PHOTOS_TAKEN_I                   699447
STATEMENTS_TAKEN_I               693339
DOORING_I                        706117
WORK_ZONE_I                      704150
WORK_ZONE_TYPE                   705048
WORKERS_PRESENT_I                707204
NUM_UNITS                             0
MOST_SEVERE_INJURY                 1541
INJURIES_TOTAL                     1530
INJURIES_FATAL                     1530
INJURIES_INCAPACITATING            1530
INJURIES_NON_INCAPACITATING        1530
INJURIES_REPORTED_NOT_EVIDENT      1530
INJURIES_NO_INDICATION             1530
INJURIES_UNKNOWN                   1530
CRASH_HOUR                            0
CRASH_DAY_OF_WEEK                     0
CRASH_MONTH                           0
LATITUDE                           4553
LONGITUDE                          4553
LOCATION                           4553
dtype: int64
In [22]:
chicago_rawdata.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 708268 entries, 0 to 708267
Data columns (total 49 columns):
 #   Column                         Non-Null Count   Dtype  
---  ------                         --------------   -----  
 0   CRASH_RECORD_ID                708268 non-null  object 
 1   RD_NO                          704252 non-null  object 
 2   CRASH_DATE_EST_I               53640 non-null   object 
 3   CRASH_DATE                     708268 non-null  object 
 4   POSTED_SPEED_LIMIT             708268 non-null  int64  
 5   TRAFFIC_CONTROL_DEVICE         708268 non-null  object 
 6   DEVICE_CONDITION               708268 non-null  object 
 7   WEATHER_CONDITION              708268 non-null  object 
 8   LIGHTING_CONDITION             708268 non-null  object 
 9   FIRST_CRASH_TYPE               708268 non-null  object 
 10  TRAFFICWAY_TYPE                708268 non-null  object 
 11  LANE_CNT                       199001 non-null  float64
 12  ALIGNMENT                      708268 non-null  object 
 13  ROADWAY_SURFACE_COND           708268 non-null  object 
 14  ROAD_DEFECT                    708268 non-null  object 
 15  REPORT_TYPE                    688596 non-null  object 
 16  CRASH_TYPE                     708268 non-null  object 
 17  INTERSECTION_RELATED_I         162489 non-null  object 
 18  NOT_RIGHT_OF_WAY_I             33179 non-null   object 
 19  HIT_AND_RUN_I                  220171 non-null  object 
 20  DAMAGE                         708268 non-null  object 
 21  DATE_POLICE_NOTIFIED           708268 non-null  object 
 22  PRIM_CONTRIBUTORY_CAUSE        708268 non-null  object 
 23  SEC_CONTRIBUTORY_CAUSE         708268 non-null  object 
 24  STREET_NO                      708268 non-null  int64  
 25  STREET_DIRECTION               708264 non-null  object 
 26  STREET_NAME                    708267 non-null  object 
 27  BEAT_OF_OCCURRENCE             708263 non-null  float64
 28  PHOTOS_TAKEN_I                 8821 non-null    object 
 29  STATEMENTS_TAKEN_I             14929 non-null   object 
 30  DOORING_I                      2151 non-null    object 
 31  WORK_ZONE_I                    4118 non-null    object 
 32  WORK_ZONE_TYPE                 3220 non-null    object 
 33  WORKERS_PRESENT_I              1064 non-null    object 
 34  NUM_UNITS                      708268 non-null  int64  
 35  MOST_SEVERE_INJURY             706727 non-null  object 
 36  INJURIES_TOTAL                 706738 non-null  float64
 37  INJURIES_FATAL                 706738 non-null  float64
 38  INJURIES_INCAPACITATING        706738 non-null  float64
 39  INJURIES_NON_INCAPACITATING    706738 non-null  float64
 40  INJURIES_REPORTED_NOT_EVIDENT  706738 non-null  float64
 41  INJURIES_NO_INDICATION         706738 non-null  float64
 42  INJURIES_UNKNOWN               706738 non-null  float64
 43  CRASH_HOUR                     708268 non-null  int64  
 44  CRASH_DAY_OF_WEEK              708268 non-null  int64  
 45  CRASH_MONTH                    708268 non-null  int64  
 46  LATITUDE                       703715 non-null  float64
 47  LONGITUDE                      703715 non-null  float64
 48  LOCATION                       703715 non-null  object 
dtypes: float64(11), int64(6), object(32)
memory usage: 264.8+ MB

Here we will create the threshold. WE cannot depend on magic numbers or bias. So we use a proportionate amount of 25%. Below we will find what is 25% of the total row count of the chicago dataset. Whatever the output is our threshold. SO this means any column with NaN values exceeding that threshold will be dropped!

In [46]:
# For chicago data threshold
25/100 * 708368 

print(' Any column yielding a greater or equal number of missing values than `177092` will be dropped from the Chicago dataset.')
 Any column yielding a greater or equal number of missing values than `177092` will be dropped from the Chicago dataset.

Now for New york data...

In [54]:
newyork_rawdata.isnull().sum()
Out[54]:
crash_date                         0
crash_time                         0
on_street_name                   258
off_street_name                  549
number_of_persons_injured          0
number_of_persons_killed           0
number_of_pedestrians_injured      0
number_of_pedestrians_killed       0
number_of_cyclist_injured          0
number_of_cyclist_killed           0
number_of_motorist_injured         0
number_of_motorist_killed          0
contributing_factor_vehicle_1      3
contributing_factor_vehicle_2    230
collision_id                       0
vehicle_type_code1                14
vehicle_type_code2               345
borough                          367
zip_code                         368
latitude                          76
longitude                         76
location                          76
cross_street_name                742
contributing_factor_vehicle_3    909
vehicle_type_code_3              919
contributing_factor_vehicle_4    973
vehicle_type_code_4              976
contributing_factor_vehicle_5    994
vehicle_type_code_5              995
dtype: int64
In [23]:
newyork_rawdata.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 29 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   crash_date                     1000 non-null   object 
 1   crash_time                     1000 non-null   object 
 2   on_street_name                 742 non-null    object 
 3   off_street_name                451 non-null    object 
 4   number_of_persons_injured      1000 non-null   int64  
 5   number_of_persons_killed       1000 non-null   int64  
 6   number_of_pedestrians_injured  1000 non-null   int64  
 7   number_of_pedestrians_killed   1000 non-null   int64  
 8   number_of_cyclist_injured      1000 non-null   int64  
 9   number_of_cyclist_killed       1000 non-null   int64  
 10  number_of_motorist_injured     1000 non-null   int64  
 11  number_of_motorist_killed      1000 non-null   int64  
 12  contributing_factor_vehicle_1  997 non-null    object 
 13  contributing_factor_vehicle_2  770 non-null    object 
 14  collision_id                   1000 non-null   int64  
 15  vehicle_type_code1             986 non-null    object 
 16  vehicle_type_code2             655 non-null    object 
 17  borough                        633 non-null    object 
 18  zip_code                       632 non-null    float64
 19  latitude                       924 non-null    float64
 20  longitude                      924 non-null    float64
 21  location                       924 non-null    object 
 22  cross_street_name              258 non-null    object 
 23  contributing_factor_vehicle_3  91 non-null     object 
 24  vehicle_type_code_3            81 non-null     object 
 25  contributing_factor_vehicle_4  27 non-null     object 
 26  vehicle_type_code_4            24 non-null     object 
 27  contributing_factor_vehicle_5  6 non-null      object 
 28  vehicle_type_code_5            5 non-null      object 
dtypes: float64(3), int64(9), object(17)
memory usage: 226.7+ KB

As it it was done for the Chicago dataset above. This will be done here again with the New york data to find the 25% threshold to use to determine which columns with too many NaN values will be dropped

In [47]:
# For new york

25/100 * 1000
print(' Any column yielding a greater number of missing values than `250` will be dropped from the New york dataset. ')
 Any column yielding a greater number of missing values than `250` will be dropped from the New york dataset. 

Summary of Issue and Justification for Quality Issue One¶

  • From the above we plotted visualizations for both datasets visualizating the missing data. Secondly, we programmtically analyzed the missing values of each column.

  • The biggest question here to avoid Bias is : "How the data will be dropped and columns will be dropped? " with the most logical way possible.

  • Not columns will be used because as seen some columns have a lot of missing data so we deem them to be "unecessary"

  • Using info() showed us the range index of the column which showed us the amount of rows in the column. So for example. For chicago data we saw the figure of 708268 that meant there are 708268 rows in each column. The same applies for New york there are 1000 rows of data in each column.

  • From knowing this information we will use a threshold. IF more than 25% of the data is missing or in NAN we drop the column completely. So whatever column yields that number greater it will be dropped. See code below for logic to negate magic numbers and form our threshold numbers to drop columns.

  • Finally, the columns that were not dropped will be considered useful and have their missing data dropped using dropna()

Quality Issue 2 : CANNOT IDENTIFY THE DIFFERENCE BETWEEN THE TWO DATASETS BY CITY. There is no City Name within either of the datasets.¶

Let us just think ahead for a quick moment. If we were to merge the two datasets at the moment. How would we be able to distinguish the data? For example how would we know which crash happened in Newyork while one happened in CHicago. Yes we have the longitude and lattitude but there is no indication helping to identify the city. If a reader was to read the dataset. THey would have no idea which city it is from. This is a big quality issue and needs to be attended too.

In [22]:
#FILL IN - Inspecting the dataframe visually via heatmaps 

# First chicago data
msn.heatmap(chicago_rawdata, cmap='YlGnBu')
Out[22]:
<Axes: >
In [24]:
# Visualizting new york city columns via Heatmap
msn.heatmap(newyork_rawdata, cmap='YlGnBu')
Out[24]:
<Axes: >
Quality Issue 2 Programatically¶

WE have been looking at the two datasets for some time now to not notice that there is no any mention about a city name anywhere. Even from the visualizations above we do not seen any column pertaining to or mentioning anything to do with city name. So how can you distinguish the datasets by city. Here we will have create a column to help with this.

Below is a programmtical test example using a test dataframe of how the column will be created and will set the expectation on what to expect for the Chicago and New york datasets respectively

In [5]:
# Creating a test to programtically show 

test_data_1 = {'name': ['Willard Morris', 'Al Jennings', 'Omar Mullins', 'Spencer McDaniel'],
'age': [20, 19, 22, 21],
'favorite_color': ['blue', 'red', 'yellow', "green"],
'grade': [88, 92, 95, 70]}
test_df_1 = pd.DataFrame(test_data_1, index = ['Willard Morris', 'Al Jennings', 'Omar Mullins', 'Spencer McDaniel'])


# Creating/Adding the new column as a test
test_df_1['City Name'] = 'City Name Here'
test_df_1
Out[5]:
name age favorite_color grade City Name
Willard Morris Willard Morris 20 blue 88 City Name Here
Al Jennings Al Jennings 19 red 92 City Name Here
Omar Mullins Omar Mullins 22 yellow 95 City Name Here
Spencer McDaniel Spencer McDaniel 21 green 70 City Name Here

AS seen there is a constant value in the new test column city. It is required for the data to be constant for the analysis.

Remember this is just an example!

Summary of Issue and Justification of Quality Issue 2¶

  • From the heatmaps above we can see the column names clearly. If the data was going to be observed from any reader or third person they will not know this crash data was from New york city or Chicago city.

  • Taking it further checking the column list of names of the datasets. There is no indication stating the city name

  • The solution to this would be add a new column with a constant value the column name could be CityName and help distinguish and explicitly state the city name. So for example having a column ename City in each dataset and having the data be NEwyork and Chicago respectively.

Tidiness Issue 1: The CRASH_DATE column in the Chicago dataset has both date and time in it. Secondly, the Newyork dataset has two date and time columns but one is unclear.¶

NOTE: The objective here is to keep the data separate from the time. It is confusing with both date and time in the same column.

We will first visually display the CRASH_DATE column in a visualization (boxplot) to actually get a visual understanding of how the data actually looks.

In [6]:
# FILL IN - Inspecting the data visually
# Visualizing seeing how the data looks on the axis of a visualization
fig1 = px.box(chicago_rawdata, y='CRASH_DATE')

fig1.show()
Tidyness Issue 1 Programatically¶

Here is the data in Chicago dataset the CRASH_DATE is as follows.

In [20]:
chicago_rawdata[['CRASH_DATE']].head()
Out[20]:
CRASH_DATE
0 03/25/2019 02:43:00 PM
1 09/05/2018 08:40:00 AM
2 07/15/2022 12:45:00 AM
3 07/15/2022 06:50:00 PM
4 06/03/2018 05:00:00 PM

The goal here is to spilt the date and time in their columns respectively for tidiness purposes. Now lets look at the New york data below.....

The New york dataset pertaining to this is more tidier and is also what is wanted from the Chicago dataset. The Newyork dataset as two columns separated already as seen below:

In [16]:
newyork_rawdata[['crash_date', 'crash_time']].head()
Out[16]:
crash_date crash_time
0 2021-09-11T00:00:00.000 2023-04-09 02:39:00
1 2022-03-26T00:00:00.000 2023-04-09 11:45:00
2 2022-06-29T00:00:00.000 2023-04-09 06:55:00
3 2021-09-11T00:00:00.000 2023-04-09 09:35:00
4 2021-12-14T00:00:00.000 2023-04-09 08:13:00

As seen from above New york has two columns crash_date and crash_time resectively. However, there are two sets of dates...Since the crash_time column has valid time stamps we will use and consider crash_time the valid column and perhaps drop the other column

NOTE THIS IS A PROGRAMMATIC EXAMPLE WHAT TO EXPECT:¶

(Not using chicago data, but just a test dataframe for now)

In [21]:
# Setting up a test dataframe
test_df_2 = pd.DataFrame({'my_timestamp': pd.date_range('2016-1-1 15:00', periods=5)})

test_df_2['new_date'] = [d.date() for d in test_df_2['my_timestamp']]
test_df_2['new_time'] = [d.time() for d in test_df_2['my_timestamp']]

test_df_2
Out[21]:
my_timestamp new_date new_time
0 2016-01-01 15:00:00 2016-01-01 15:00:00
1 2016-01-02 15:00:00 2016-01-02 15:00:00
2 2016-01-03 15:00:00 2016-01-03 15:00:00
3 2016-01-04 15:00:00 2016-01-04 15:00:00
4 2016-01-05 15:00:00 2016-01-05 15:00:00

This is how we want to separate the date and time columns in each of our Newyork and Chicago data respectively. This step will be done later on the datarames respectively.

Summary of ISSUE AND JUSTIFICATION of Tidiness ISSUE 1¶

  • From the above visualization the CRASH_DATE variable from the Chicago dataset appears on the y-axis. It can be seen how the data looks with the date and time. The objective is to separate the data and perhaps create a new column for date and time separately. Fina

  • Next we programmtically inspected the New york data and saw that the crash_date and crash_time properly separated so no tidying needs to be done there.

  • However, one of the columns in New york has missing time but has dates. While crash_time has both date and time stamps. So we will drop crash_date.

  • THe solution here for the CHicago data would be to separate the time and data. Have the date separate like so 01/13/2018 and time like so 02:43:00 PM. The data will be placed in two different columns respectively. `

Tidiness Issue 2: IDENTIFYING COLUMNS RENAMING COLUMNS in each DATASET that Yield the same data¶

Pause for a moment...if were to merge ourdatasets at the moment we would have more columns to deal with and more repepititive work to be done... Sometimes data accross two datasets is more common than you think but sometimes it is just the column names that throw us off. This is a indirect Tidiness issue that comes from working with two datasets. It is our job as the analyst to fix this!

From observing the datasets while working on them we did notice many columns in each dataset are similar or compromise of similar data. For example just in the previous tidy issue TIDINESS ISSUE 1 we noticed the two columns crash_date and CRASH_DATE respectively. Both columns are stating the same thing but both columns are named differently.Let us visualize below.

In [30]:
#FILL IN - Inspecting the dataframe visually and programmatically
chicago_rawdata['CRASH_DATE'].head()
Out[30]:
0    03/25/2019 02:43:00 PM
1    09/05/2018 08:40:00 AM
2    07/15/2022 12:45:00 AM
3    07/15/2022 06:50:00 PM
4    06/03/2018 05:00:00 PM
Name: CRASH_DATE, dtype: object
In [31]:
newyork_rawdata['crash_date'].head()
Out[31]:
0    2021-09-11T00:00:00.000
1    2022-03-26T00:00:00.000
2    2022-06-29T00:00:00.000
3    2021-09-11T00:00:00.000
4    2021-12-14T00:00:00.000
Name: crash_date, dtype: object

As said both columns yield the date/time data. However, the inconsistency that posses an issue here is the column names.There are more examples like this from each of the dataset:

In [32]:
chicago_rawdata.columns
Out[32]:
Index(['CRASH_RECORD_ID', 'RD_NO', 'CRASH_DATE_EST_I', 'CRASH_DATE',
       'POSTED_SPEED_LIMIT', 'TRAFFIC_CONTROL_DEVICE', 'DEVICE_CONDITION',
       'WEATHER_CONDITION', 'LIGHTING_CONDITION', 'FIRST_CRASH_TYPE',
       'TRAFFICWAY_TYPE', 'LANE_CNT', 'ALIGNMENT', 'ROADWAY_SURFACE_COND',
       'ROAD_DEFECT', 'REPORT_TYPE', 'CRASH_TYPE', 'INTERSECTION_RELATED_I',
       'NOT_RIGHT_OF_WAY_I', 'HIT_AND_RUN_I', 'DAMAGE', 'DATE_POLICE_NOTIFIED',
       'PRIM_CONTRIBUTORY_CAUSE', 'SEC_CONTRIBUTORY_CAUSE', 'STREET_NO',
       'STREET_DIRECTION', 'STREET_NAME', 'BEAT_OF_OCCURRENCE',
       'PHOTOS_TAKEN_I', 'STATEMENTS_TAKEN_I', 'DOORING_I', 'WORK_ZONE_I',
       'WORK_ZONE_TYPE', 'WORKERS_PRESENT_I', 'NUM_UNITS',
       'MOST_SEVERE_INJURY', 'INJURIES_TOTAL', 'INJURIES_FATAL',
       'INJURIES_INCAPACITATING', 'INJURIES_NON_INCAPACITATING',
       'INJURIES_REPORTED_NOT_EVIDENT', 'INJURIES_NO_INDICATION',
       'INJURIES_UNKNOWN', 'CRASH_HOUR', 'CRASH_DAY_OF_WEEK', 'CRASH_MONTH',
       'LATITUDE', 'LONGITUDE', 'LOCATION'],
      dtype='object')
In [33]:
newyork_rawdata.columns
Out[33]:
Index(['crash_date', 'crash_time', 'on_street_name', 'off_street_name',
       'number_of_persons_injured', 'number_of_persons_killed',
       'number_of_pedestrians_injured', 'number_of_pedestrians_killed',
       'number_of_cyclist_injured', 'number_of_cyclist_killed',
       'number_of_motorist_injured', 'number_of_motorist_killed',
       'contributing_factor_vehicle_1', 'contributing_factor_vehicle_2',
       'collision_id', 'vehicle_type_code1', 'vehicle_type_code2', 'borough',
       'zip_code', 'latitude', 'longitude', 'location', 'cross_street_name',
       'contributing_factor_vehicle_3', 'vehicle_type_code_3',
       'contributing_factor_vehicle_4', 'vehicle_type_code_4',
       'contributing_factor_vehicle_5', 'vehicle_type_code_5'],
      dtype='object')
In [36]:
columns_to_rename_chicago = chicago_rawdata[['CRASH_RECORD_ID', 'INJURIES_TOTAL', 'LATITUDE', 'LONGITUDE', 'LOCATION']]
columns_to_rename_chicago.head()
Out[36]:
CRASH_RECORD_ID INJURIES_TOTAL LATITUDE LONGITUDE LOCATION
0 79c7a2ce89f446262efd86df3d72d18b04ba487024b7c4... 1.0 41.884547 -87.641201 POINT (-87.64120093714 41.884547224337)
1 792b539deaaad65ee5b4a9691d927a34d298eb33d42af0... 0.0 41.968562 -87.740659 POINT (-87.740659314632 41.968562453871)
2 0115ade9a755e835255508463f7e9c4a9a0b47e9304238... 0.0 41.886336 -87.716203 POINT (-87.716203130599 41.886336409761)
3 017040c61958d2fa977c956b2bd2d6759ef7754496dc96... 0.0 41.925111 -87.667997 POINT (-87.667997321599 41.925110815832)
4 78eee027ec3dcc85d36c9e3fdae4729dcc56440105d65b... 0.0 41.910758 -87.731389 POINT (-87.731388754145 41.910757551599)
In [35]:
columns_to_rename_newyork = newyork_rawdata[['collision_id', 'number_of_persons_injured', 'latitude', 'longitude', 'location']]
columns_to_rename_newyork.head()
Out[35]:
collision_id number_of_persons_injured latitude longitude location
0 4455765 2 NaN NaN NaN
1 4513547 1 NaN NaN NaN
2 4541903 0 NaN NaN NaN
3 4456314 0 40.667202 -73.866500 {'latitude': '40.667202', 'longitude': '-73.86...
4 4486609 0 40.683304 -73.917274 {'latitude': '40.683304', 'longitude': '-73.91...
In [8]:
chicago_rawdata.columns
Out[8]:
Index(['CRASH_RECORD_ID', 'RD_NO', 'CRASH_DATE_EST_I', 'CRASH_DATE',
       'POSTED_SPEED_LIMIT', 'TRAFFIC_CONTROL_DEVICE', 'DEVICE_CONDITION',
       'WEATHER_CONDITION', 'LIGHTING_CONDITION', 'FIRST_CRASH_TYPE',
       'TRAFFICWAY_TYPE', 'LANE_CNT', 'ALIGNMENT', 'ROADWAY_SURFACE_COND',
       'ROAD_DEFECT', 'REPORT_TYPE', 'CRASH_TYPE', 'INTERSECTION_RELATED_I',
       'NOT_RIGHT_OF_WAY_I', 'HIT_AND_RUN_I', 'DAMAGE', 'DATE_POLICE_NOTIFIED',
       'PRIM_CONTRIBUTORY_CAUSE', 'SEC_CONTRIBUTORY_CAUSE', 'STREET_NO',
       'STREET_DIRECTION', 'STREET_NAME', 'BEAT_OF_OCCURRENCE',
       'PHOTOS_TAKEN_I', 'STATEMENTS_TAKEN_I', 'DOORING_I', 'WORK_ZONE_I',
       'WORK_ZONE_TYPE', 'WORKERS_PRESENT_I', 'NUM_UNITS',
       'MOST_SEVERE_INJURY', 'INJURIES_TOTAL', 'INJURIES_FATAL',
       'INJURIES_INCAPACITATING', 'INJURIES_NON_INCAPACITATING',
       'INJURIES_REPORTED_NOT_EVIDENT', 'INJURIES_NO_INDICATION',
       'INJURIES_UNKNOWN', 'CRASH_HOUR', 'CRASH_DAY_OF_WEEK', 'CRASH_MONTH',
       'LATITUDE', 'LONGITUDE', 'LOCATION'],
      dtype='object')

From the above two lists. It can be seen that these columns have pretty much in common. Only the column names are differntiating them. So thinking ahead. We will re-name each of these columns with the same name so for example:

The following Columns will be renamed like so in each dataset respectively:

CRASH_RECORD_ID and colision_id will be renamed to crash_collision_id

INJURIES_TOTAL and number_of_persons_injured will be renamed to total_persons_injured

LATITUDE and latitude will be renamed to latitude

LONGITUDE and longitude will be renamed to longitude

LOCATION and location will be renamed to location

INJURIES_FATAL and number_of_persons_killed will be renamed to killed_by_accident

Summary of ISSUE AND JUSTIFICATION of Tidiness ISSUE 2¶
  • AS said earlier, we notice that there are some columns in each of the datasets that carry similar meaning. However, the column names are not the same.

  • Secondly, the capitalization of some columns and some not. For example it is noticed in the CHicago dataset the columns are in capital letters while the New york data is common letters. Here, we will stick the common letter format. to keep the dataset flowing on the same length.

  • Thirdly, we outlined the colomumns in each dataset that are similarly oriented with the same data outputs. AS said in the point above, common letters will be used. So therfore we will re-name the columns and change their capitalization.

  • Finally, all columns will be changed to common letters eventually when renamed.

3. Clean data¶

Clean the data to solve the 4 issues corresponding to data quality and tidiness found in the assessing step. Make sure you include justifications for your cleaning decisions.

After the cleaning for each issue, please use either the visually or programatical method to validate the cleaning was succesful.

At this stage, you are also expected to remove variables that are unnecessary for your analysis and combine your datasets. Depending on your datasets, you may choose to perform variable combination and elimination before or after the cleaning stage. Your dataset must have at least 4 variables after combining the data.

First before we proceed we must make copies of our Original Datasets to avoid conflicts!¶

In [5]:
# FILL IN - Make copies of the datasets to ensure the raw dataframes 
# are not impacted

# Copy dataset one which is chicago data
chicago_data_wrangled = chicago_rawdata.copy()

# Copy dataset two which is new york data 
newyork_data_wrangled = newyork_rawdata.copy()

Quality Issue 1: Removing Missing data ¶

Here we will clean both datasets and drop the uncessary columns while dropping the missing values of the columns with missing data. We apply Pandas has two methods to do this:

  • drop() method. which drops the column entirely.
  • dropna() method. To clean the missing values of each dataset.

REMINDER: Remember in the assessment section we created thresholds... When running isnull().sum() we saw the values. Here we simply looked through that output and selected each column not falling with the thresgold for each dataset respectively.

In [6]:
# Dropping columns from the chicago dataset
chicago_data_wrangled = chicago_data_wrangled.drop(['RD_NO', 'CRASH_DATE_EST_I', 'LANE_CNT', 'INTERSECTION_RELATED_I',
                                                   'NOT_RIGHT_OF_WAY_I', 'HIT_AND_RUN_I', 'HIT_AND_RUN_I',
                                                   'PHOTOS_TAKEN_I', 'STATEMENTS_TAKEN_I', 'DOORING_I', 'WORK_ZONE_I',
                                                   'WORK_ZONE_TYPE', 'WORKERS_PRESENT_I'], axis=1)



# Dropping columns from the new york dataset

newyork_data_wrangled = newyork_data_wrangled.drop(['on_street_name', 'off_street_name', 'vehicle_type_code2', 'borough', 
                                                    'zip_code', 'cross_street_name', 'contributing_factor_vehicle_3',
                                                    'vehicle_type_code_3', 'contributing_factor_vehicle_4', 
                                                    'vehicle_type_code_4', 'contributing_factor_vehicle_5',
                                                   'vehicle_type_code_5'], axis=1)


print('COlumns successfully dropped!')
COlumns successfully dropped!

Now that we dropped the columns. Let us drop the missing values.

In [9]:
# Dropping missing rows for chicago data
chicago_data_wrangled = chicago_data_wrangled.dropna()

# Dropping missing rows for New york data
newyork_data_wrangled = newyork_data_wrangled.dropna()

print('Dropped Missing values!')
Dropped Missing values!
Validating if cleaning was successful...¶
In [10]:
chicago_data_wrangled.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 682625 entries, 0 to 708267
Data columns (total 37 columns):
 #   Column                         Non-Null Count   Dtype  
---  ------                         --------------   -----  
 0   CRASH_RECORD_ID                682625 non-null  object 
 1   CRASH_DATE                     682625 non-null  object 
 2   POSTED_SPEED_LIMIT             682625 non-null  int64  
 3   TRAFFIC_CONTROL_DEVICE         682625 non-null  object 
 4   DEVICE_CONDITION               682625 non-null  object 
 5   WEATHER_CONDITION              682625 non-null  object 
 6   LIGHTING_CONDITION             682625 non-null  object 
 7   FIRST_CRASH_TYPE               682625 non-null  object 
 8   TRAFFICWAY_TYPE                682625 non-null  object 
 9   ALIGNMENT                      682625 non-null  object 
 10  ROADWAY_SURFACE_COND           682625 non-null  object 
 11  ROAD_DEFECT                    682625 non-null  object 
 12  REPORT_TYPE                    682625 non-null  object 
 13  CRASH_TYPE                     682625 non-null  object 
 14  DAMAGE                         682625 non-null  object 
 15  DATE_POLICE_NOTIFIED           682625 non-null  object 
 16  PRIM_CONTRIBUTORY_CAUSE        682625 non-null  object 
 17  SEC_CONTRIBUTORY_CAUSE         682625 non-null  object 
 18  STREET_NO                      682625 non-null  int64  
 19  STREET_DIRECTION               682625 non-null  object 
 20  STREET_NAME                    682625 non-null  object 
 21  BEAT_OF_OCCURRENCE             682625 non-null  float64
 22  NUM_UNITS                      682625 non-null  int64  
 23  MOST_SEVERE_INJURY             682625 non-null  object 
 24  INJURIES_TOTAL                 682625 non-null  float64
 25  INJURIES_FATAL                 682625 non-null  float64
 26  INJURIES_INCAPACITATING        682625 non-null  float64
 27  INJURIES_NON_INCAPACITATING    682625 non-null  float64
 28  INJURIES_REPORTED_NOT_EVIDENT  682625 non-null  float64
 29  INJURIES_NO_INDICATION         682625 non-null  float64
 30  INJURIES_UNKNOWN               682625 non-null  float64
 31  CRASH_HOUR                     682625 non-null  int64  
 32  CRASH_DAY_OF_WEEK              682625 non-null  int64  
 33  CRASH_MONTH                    682625 non-null  int64  
 34  LATITUDE                       682625 non-null  float64
 35  LONGITUDE                      682625 non-null  float64
 36  LOCATION                       682625 non-null  object 
dtypes: float64(10), int64(6), object(21)
memory usage: 197.9+ MB
In [59]:
newyork_data_wrangled.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 711 entries, 6 to 997
Data columns (total 17 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   crash_date                     711 non-null    object 
 1   crash_time                     711 non-null    object 
 2   number_of_persons_injured      711 non-null    int64  
 3   number_of_persons_killed       711 non-null    int64  
 4   number_of_pedestrians_injured  711 non-null    int64  
 5   number_of_pedestrians_killed   711 non-null    int64  
 6   number_of_cyclist_injured      711 non-null    int64  
 7   number_of_cyclist_killed       711 non-null    int64  
 8   number_of_motorist_injured     711 non-null    int64  
 9   number_of_motorist_killed      711 non-null    int64  
 10  contributing_factor_vehicle_1  711 non-null    object 
 11  contributing_factor_vehicle_2  711 non-null    object 
 12  collision_id                   711 non-null    int64  
 13  vehicle_type_code1             711 non-null    object 
 14  latitude                       711 non-null    float64
 15  longitude                      711 non-null    float64
 16  location                       711 non-null    object 
dtypes: float64(2), int64(9), object(6)
memory usage: 100.0+ KB
Justification:¶
  • We first identified the columns not falling within the thresholds and succesfully dropped them using drop(). Before dropping columns. Chicago data had 49 columns. After dropping it has 39 columns. WHile New york had 29 columns it now has 17 columns after dropping.
  • We used the Pandas method dropna() to drop the missing rows of data from each column within both datasets.
  • WE then used info() on both datasets again to validate our cleaning strategy to see if it was successful. Now we see some differences. For example in Chicago data we first started with 708268 rows now we have 682625 rows of data. While for new york we had 1000 rows . Now we have 711 rows.

Quality Issue 2:Adding the default column with city name¶

AS mentioned in the Assessment section we need to add a column identifying the city. This column will be a constant value as mentioned.Below we will create a column called city_name in each of ourdataset.

In [10]:
#FILL IN - Apply the cleaning strategy

# Creating the new default columns for our datasets

# First with chicago
chicago_data_wrangled['city_name'] = 'Chicago-City'

# Then the newyork dataset
newyork_data_wrangled['city_name'] = 'NewYork-City'

print('New column created!')
New column created!
Validating if columns were successfully created in the dataframes respectfully.¶
In [10]:
#FILL IN - Validate the cleaning was successful

# First chicago data
chicago_data_wrangled[['CRASH_DATE', 'city_name']].head()
Out[10]:
CRASH_DATE city_name
0 03/25/2019 02:43:00 PM Chicago-City
1 09/05/2018 08:40:00 AM Chicago-City
2 07/15/2022 12:45:00 AM Chicago-City
3 07/15/2022 06:50:00 PM Chicago-City
4 06/03/2018 05:00:00 PM Chicago-City
In [14]:
# Now newyork data
newyork_data_wrangled[['crash_date', 'city_name']].head()
Out[14]:
crash_date city_name
6 2021-12-14T00:00:00.000 NewYork-City
7 2021-12-14T00:00:00.000 NewYork-City
8 2021-12-14T00:00:00.000 NewYork-City
9 2021-12-14T00:00:00.000 NewYork-City
10 2021-12-13T00:00:00.000 NewYork-City
Justification:¶
  • The new column city_name was successfully created in both chicago_data_wrangled and newyork_data_wrangleddatasets respectively.

  • When validating by outputs we do see that the city_name column in the chicago dataset outputs Chicago-City

  • Finally, while validating the output for New york we do see city_name output NewYork-City

Tidiness Issue 1: Separating the CRASH_DATE in the Chicago dataset to create new columns for date and time respectively. Secondly, removing the uncessary date column crash_date from the New york dataset. In the New york dataset we continue by separating the crash_time column into date and time respectively.¶

Starting with Chicago.... Lets convert the data type first to date/time as we need to separate the column to create two new ones

In [11]:
# First convert the CRASH DATE column to date/time format 
chicago_data_wrangled['CRASH_DATE'] = pd.to_datetime(chicago_data_wrangled['CRASH_DATE'])
print ('Successfully converted the `CRASH_DATE` to date/time format!')
Successfully converted the `CRASH_DATE` to date/time format!

Below we can go ahead and split the CRASH_DATE column to separate the date and time into two new columns.

In [14]:
#FILL IN - Apply the cleaning strategy

# CHicago dataset
chicago_data_wrangled['crash_date'] = [d.date() for d in chicago_data_wrangled['CRASH_DATE']]
chicago_data_wrangled['crash_time'] = [d.time() for d in chicago_data_wrangled['CRASH_DATE']]

print ('Successfully split `CRASH_DATE` into two new columns `crash_date` and `crash_time`!')
Successfully split `CRASH_DATE` into two new columns `crash_date` and `crash_time`!

Now lets work on the New york data...

First we drop the complex date column crash_date

In [7]:
newyork_data_wrangled = newyork_data_wrangled.drop(['crash_date'], axis=1)
('Successfully dropped crash_date column!')
Out[7]:
'Successfully dropped crash_date column!'

Here, we will do what we did similar above to split the time and date into two separate columns.

In [8]:
newyork_data_wrangled['crash_time'] = pd.to_datetime(newyork_data_wrangled['crash_time'])
print ('Successfully converted column to date/time format')
Successfully converted column to date/time format
In [9]:
newyork_data_wrangled['crash_date'] = [d.date() for d in newyork_data_wrangled['crash_time']]
newyork_data_wrangled['crash_time'] = [d.time() for d in newyork_data_wrangled['crash_time']]

print ('Successfully split the`crash_time` column into two new columns `crash_date` and `crash_time`!')
Successfully split the`crash_time` column into two new columns `crash_date` and `crash_time`!
Validating¶

First we can validate that the CRASH_DATE column from chicago was successfully split and formed two new columns crash_date and crash_time.

In [17]:
#FILL IN - Validate the cleaning was successful
chicago_data_wrangled[['CRASH_DATE', 'crash_date', 'crash_time']].head()
Out[17]:
CRASH_DATE crash_date crash_time
0 2019-03-25 14:43:00 2019-03-25 14:43:00
1 2018-09-05 08:40:00 2018-09-05 08:40:00
2 2022-07-15 00:45:00 2022-07-15 00:45:00
3 2022-07-15 18:50:00 2022-07-15 18:50:00
4 2018-06-03 17:00:00 2018-06-03 17:00:00

Second we can look at the New york data to see the respective columns split.

In [45]:
newyork_data_wrangled[['crash_date', 'crash_time']].head()
Out[45]:
crash_date crash_time
6 2023-04-09 17:05:00
7 2023-04-09 08:17:00
8 2023-04-09 21:10:00
9 2023-04-09 14:58:00
10 2023-04-09 00:34:00
Justification:¶
  • The first thing was to conver the CRASH_DATE column to pandas date/time format
  • Secondly, we had to split CRASH_DATE from chicago dataset to new columns crash_date and crash_time
  • Thirdly, we dropped the current crash_date column in Newyork dataset because the data seemed incomplete
  • Fourthly, we then re created the crash_date column by properly splitting the date and time respectively from the crash_time column
  • Finally, we now have a crash_date and crash_time column respectively.

Tidiness Issue 2: FILL IN: Renaming columns in both datasets that yield the same data but just different column names.¶

In the assessment section we had outlined a few columns that will be renamed in each dataframe. Each of these columns carry out the same data in their own dataframes respectivelly, but the only difference is their name and column heading. We can fix this issue to make the column names more tidy.

In [48]:
#FILL IN - Apply the cleaning strategy

# Starting with chicago data.
chicago_data_wrangled = chicago_data_wrangled.rename({'CRASH_RECORD_ID': 'crash_collision_id',
                                                      'INJURIES_TOTAL': 'total_persons_injured',
                                                     'LATITUDE': 'latitude', 'LONGITUDE': 'longitude',
                                                     'LOCATION': 'location', 'INJURIES_FATAL': 'killed_by_crash'}, axis=1)


print ('Renaming the columns complete!')
Renaming the columns complete!
In [55]:
newyork_data_wrangled = newyork_data_wrangled.rename({'collision_id': 'crash_collision_id',
                                                      'number_of_persons_injured': 'total_persons_injured',
                                                     'LATITUDE': 'latitude', 'LONGITUDE': 'longitude',
                                                     'LOCATION': 'location','number_of_persons_killed': 'killed_by_crash'}, axis=1)


print ('Renaming the columns complete!')
Renaming the columns complete!

And complete we have renamed the necessary columns in each of the datasets

Validating:¶

I will just simply print the columns of the wrangled dataset VS the original ones.

In [56]:
#FILL IN - Validate the cleaning was successful

#first chicago

chicago_column_list = chicago_rawdata.columns



# Now the renamed list
chicago_renamed_list = chicago_data_wrangled.columns


print(chicago_column_list)
print (chicago_renamed_list)
Index(['CRASH_RECORD_ID', 'RD_NO', 'CRASH_DATE_EST_I', 'CRASH_DATE',
       'POSTED_SPEED_LIMIT', 'TRAFFIC_CONTROL_DEVICE', 'DEVICE_CONDITION',
       'WEATHER_CONDITION', 'LIGHTING_CONDITION', 'FIRST_CRASH_TYPE',
       'TRAFFICWAY_TYPE', 'LANE_CNT', 'ALIGNMENT', 'ROADWAY_SURFACE_COND',
       'ROAD_DEFECT', 'REPORT_TYPE', 'CRASH_TYPE', 'INTERSECTION_RELATED_I',
       'NOT_RIGHT_OF_WAY_I', 'HIT_AND_RUN_I', 'DAMAGE', 'DATE_POLICE_NOTIFIED',
       'PRIM_CONTRIBUTORY_CAUSE', 'SEC_CONTRIBUTORY_CAUSE', 'STREET_NO',
       'STREET_DIRECTION', 'STREET_NAME', 'BEAT_OF_OCCURRENCE',
       'PHOTOS_TAKEN_I', 'STATEMENTS_TAKEN_I', 'DOORING_I', 'WORK_ZONE_I',
       'WORK_ZONE_TYPE', 'WORKERS_PRESENT_I', 'NUM_UNITS',
       'MOST_SEVERE_INJURY', 'INJURIES_TOTAL', 'INJURIES_FATAL',
       'INJURIES_INCAPACITATING', 'INJURIES_NON_INCAPACITATING',
       'INJURIES_REPORTED_NOT_EVIDENT', 'INJURIES_NO_INDICATION',
       'INJURIES_UNKNOWN', 'CRASH_HOUR', 'CRASH_DAY_OF_WEEK', 'CRASH_MONTH',
       'LATITUDE', 'LONGITUDE', 'LOCATION'],
      dtype='object')
Index(['crash_collision_id', 'CRASH_DATE', 'POSTED_SPEED_LIMIT',
       'TRAFFIC_CONTROL_DEVICE', 'DEVICE_CONDITION', 'WEATHER_CONDITION',
       'LIGHTING_CONDITION', 'FIRST_CRASH_TYPE', 'TRAFFICWAY_TYPE',
       'ALIGNMENT', 'ROADWAY_SURFACE_COND', 'ROAD_DEFECT', 'REPORT_TYPE',
       'CRASH_TYPE', 'DAMAGE', 'DATE_POLICE_NOTIFIED',
       'PRIM_CONTRIBUTORY_CAUSE', 'SEC_CONTRIBUTORY_CAUSE', 'STREET_NO',
       'STREET_DIRECTION', 'STREET_NAME', 'BEAT_OF_OCCURRENCE', 'NUM_UNITS',
       'MOST_SEVERE_INJURY', 'total_persons_injured', 'killed_by_crash',
       'INJURIES_INCAPACITATING', 'INJURIES_NON_INCAPACITATING',
       'INJURIES_REPORTED_NOT_EVIDENT', 'INJURIES_NO_INDICATION',
       'INJURIES_UNKNOWN', 'CRASH_HOUR', 'CRASH_DAY_OF_WEEK', 'CRASH_MONTH',
       'latitude', 'longitude', 'location', 'city_name', 'crash_date',
       'crash_time'],
      dtype='object')
In [57]:
#FILL IN - Validate the cleaning was successful

#first chicago

newyork_column_list = newyork_rawdata.columns



# Now the renamed list
newyork_renamed_list = newyork_data_wrangled.columns


print(newyork_column_list)
print (newyork_renamed_list)
Index(['crash_date', 'crash_time', 'on_street_name', 'off_street_name',
       'number_of_persons_injured', 'number_of_persons_killed',
       'number_of_pedestrians_injured', 'number_of_pedestrians_killed',
       'number_of_cyclist_injured', 'number_of_cyclist_killed',
       'number_of_motorist_injured', 'number_of_motorist_killed',
       'contributing_factor_vehicle_1', 'contributing_factor_vehicle_2',
       'collision_id', 'vehicle_type_code1', 'vehicle_type_code2', 'borough',
       'zip_code', 'latitude', 'longitude', 'location', 'cross_street_name',
       'contributing_factor_vehicle_3', 'vehicle_type_code_3',
       'contributing_factor_vehicle_4', 'vehicle_type_code_4',
       'contributing_factor_vehicle_5', 'vehicle_type_code_5'],
      dtype='object')
Index(['crash_time', 'total_persons_injured', 'killed_by_crash',
       'number_of_pedestrians_injured', 'number_of_pedestrians_killed',
       'number_of_cyclist_injured', 'number_of_cyclist_killed',
       'number_of_motorist_injured', 'number_of_motorist_killed',
       'contributing_factor_vehicle_1', 'contributing_factor_vehicle_2',
       'crash_collision_id', 'vehicle_type_code1', 'latitude', 'longitude',
       'location', 'city_name', 'crash_date'],
      dtype='object')

From above list comparisons you can successfully see that the columns were successfully renamed!

Justification:¶
  • The Columns were renamed using the .rename method from pandas in each dataset respectivelly

  • THe wrangled dataset columns were compared to the raw dataset columns to observe the difference.

Remove unnecessary variables and combine datasets¶

Now we will not need all the variables and columns so here we will manually select the columns/variables required from each dataset. We will create two new datasets one called chicago_wrangled_complete and newyork_wrangled_complete. These will store the important variables going forward.

In [66]:
# Setting up new dataset with important variables
chicago_wrangled_complete = chicago_data_wrangled[['crash_collision_id', 'total_persons_injured', 'killed_by_crash',
                                                 'latitude', 'longitude', 'location', 'city_name', 'crash_date', 'crash_time']]

chicago_wrangled_complete.head()
Out[66]:
crash_collision_id total_persons_injured killed_by_crash latitude longitude location city_name crash_date crash_time
0 79c7a2ce89f446262efd86df3d72d18b04ba487024b7c4... 1.0 0.0 41.884547 -87.641201 POINT (-87.64120093714 41.884547224337) Chicago-City 2019-03-25 14:43:00
1 792b539deaaad65ee5b4a9691d927a34d298eb33d42af0... 0.0 0.0 41.968562 -87.740659 POINT (-87.740659314632 41.968562453871) Chicago-City 2018-09-05 08:40:00
2 0115ade9a755e835255508463f7e9c4a9a0b47e9304238... 0.0 0.0 41.886336 -87.716203 POINT (-87.716203130599 41.886336409761) Chicago-City 2022-07-15 00:45:00
3 017040c61958d2fa977c956b2bd2d6759ef7754496dc96... 0.0 0.0 41.925111 -87.667997 POINT (-87.667997321599 41.925110815832) Chicago-City 2022-07-15 18:50:00
4 78eee027ec3dcc85d36c9e3fdae4729dcc56440105d65b... 0.0 0.0 41.910758 -87.731389 POINT (-87.731388754145 41.910757551599) Chicago-City 2018-06-03 17:00:00
In [67]:
newyork_wrangled_complete = newyork_data_wrangled[['crash_collision_id', 'total_persons_injured', 'killed_by_crash',
                                                   'latitude', 'longitude','location', 'city_name', 
                                                   'crash_date', 'crash_time']]

newyork_wrangled_complete.head()
Out[67]:
crash_collision_id total_persons_injured killed_by_crash latitude longitude location city_name crash_date crash_time
6 4486555 0 0 40.709183 -73.956825 {'latitude': '40.709183', 'longitude': '-73.95... NewYork-City 2023-04-09 17:05:00
7 4486660 2 0 40.868160 -73.831480 {'latitude': '40.86816', 'longitude': '-73.831... NewYork-City 2023-04-09 08:17:00
8 4487074 0 0 40.671720 -73.897100 {'latitude': '40.67172', 'longitude': '-73.897... NewYork-City 2023-04-09 21:10:00
9 4486519 0 0 40.751440 -73.973970 {'latitude': '40.75144', 'longitude': '-73.973... NewYork-City 2023-04-09 14:58:00
10 4486934 0 0 40.701275 -73.888870 {'latitude': '40.701275', 'longitude': '-73.88... NewYork-City 2023-04-09 00:34:00

We now have our two completely wrangled datasets chicago_wrangled_complete and newyork_wrangled_complete focusing on the necessary variables we wish to analyze. Now we will merge both our datasets using pd.concat to create one new whole variable called chicago_newyork_crash_data

In [64]:
chicago_newyork_crash_data = pd.concat([chicago_wrangled_complete, newyork_wrangled_complete])

chicago_newyork_crash_data.head()
Out[64]:
crash_collision_id total_persons_injured killed_by_crash latitude longitude location city_name crash_date crash_time
0 79c7a2ce89f446262efd86df3d72d18b04ba487024b7c4... 1.0 0.0 41.884547 -87.641201 POINT (-87.64120093714 41.884547224337) Chicago-City 2019-03-25 14:43:00
1 792b539deaaad65ee5b4a9691d927a34d298eb33d42af0... 0.0 0.0 41.968562 -87.740659 POINT (-87.740659314632 41.968562453871) Chicago-City 2018-09-05 08:40:00
2 0115ade9a755e835255508463f7e9c4a9a0b47e9304238... 0.0 0.0 41.886336 -87.716203 POINT (-87.716203130599 41.886336409761) Chicago-City 2022-07-15 00:45:00
3 017040c61958d2fa977c956b2bd2d6759ef7754496dc96... 0.0 0.0 41.925111 -87.667997 POINT (-87.667997321599 41.925110815832) Chicago-City 2022-07-15 18:50:00
4 78eee027ec3dcc85d36c9e3fdae4729dcc56440105d65b... 0.0 0.0 41.910758 -87.731389 POINT (-87.731388754145 41.910757551599) Chicago-City 2018-06-03 17:00:00

4. Update your data store¶

Here I will update and store the final version of the dataset.

  1. NOTE: The original raw datasets were already stored at the end of Section 1 earlier when the data was loaded.The raw datasets were stored in the following folder /datasets. The two data files stored in there are: traffic-crashes-chicago.csv and dataset_two.csv respectively.

  2. The final cleaned/merged dataset will be stored in /final_cleaned_dataset. The datafile name is chicago_newyork_crash_data.csv

The final data will be converted and saved as a file CSV below

In [76]:
os.makedirs('final_cleaned_dataset/', exist_ok=True)
chicago_newyork_crash_data.to_csv('final_cleaned_dataset/chicago_newyork_crash_data.csv')

print ('Final merged and cleaned dataset has been successfully saved in folder final_cleaned_dataset!')
Final merged and cleaned dataset has been successfully saved in folder final_cleaned_dataset!

5. Answer the research question¶

5.1: Define and answer the research question¶

Going back to the problem statement in step 1, use the cleaned data to answer the question you raised. Produce at least two visualizations using the cleaned data and explain how they help you answer the question.

Using a sample of the data I will plot some visualizations to answer the questions.

In [122]:
# Creating a sample segment 
sample_final_data = chicago_newyork_crash_data.iloc[-2000:]
print('Sample created!')
Sample created!

Research question 1: Which city has a higher kill count when crashes occur, Chicago or New york city?

In [150]:
#Visual 1 - FILL IN

ax = sb.countplot(data=sample_final_data, y="city_name", hue='killed_by_crash')
plt.title('The city that has the most kill count by Crashes')
Out[150]:
Text(0.5, 1.0, 'The city that has the most kill count by Crashes')

Answer to research question 1:

From the above count plot it can be seen that Chicago City produces a higher kill count in general when accidents occur. THis means that the death rate in accidents is greater in Chicago city over New york City. Chicago City's count by persons who have been killed by crashed was exceeding 1200 counts. While New York was approaching 800. That is almost a difference of 500 counts.

Research question 2: Between both Chicago and NewYork. Which date or time of the year has recorded the highest fatalatiesin general between both cities?

In [149]:
sb.lineplot(data=sample_final_data, x="crash_date", y="killed_by_crash")
plt.title('The date/year with the highest Fatalaties between both cities')
plt.xticks(rotation=90)
Out[149]:
(array([19327., 19341., 19358., 19372., 19389., 19403., 19417., 19431.,
        19448., 19462.]),
 [Text(19327.0, 0, '2022-12-01'),
  Text(19341.0, 0, '2022-12-15'),
  Text(19358.0, 0, '2023-01-01'),
  Text(19372.0, 0, '2023-01-15'),
  Text(19389.0, 0, '2023-02-01'),
  Text(19403.0, 0, '2023-02-15'),
  Text(19417.0, 0, '2023-03-01'),
  Text(19431.0, 0, '2023-03-15'),
  Text(19448.0, 0, '2023-04-01'),
  Text(19462.0, 0, '2023-04-15')])

Answer to research question 2:

From the above it can be seen that within the recent year of 2023. The month of March seemed to be a catastrophic month for both cities recording a 25% killing happening everytime a accident occured within either city. Data seemed more flatter earlier on in the year as it may have been winter and less driving was going on.

5.2: Reflection¶

In 2-4 sentences, if you had more time to complete the project, what actions would you take? For example, which data quality and structural issues would you look into further, and what research questions would you further explore?

My Reflection:

  • There were more quality issues and structural issues for example the datasets were mis matched in size by a very large proportion. For example the Chicago dataset had tens of thousands of values while the Newyork dataset had approximately a 1000.
  • To fix the issue above. Feature scaling is a part of it. Feature scaling may have helped to bring the data more on the same scale. Inspite of dropping columns and dropping missing values to negate some skew. Feature scaling would greatly help with a vast dataset.
  • Another thing I would have loved to try is Feature Engineering. This dataset could maybe had offered more. Some of the dataset columns were dropped due to missing values. But sometimes those columns do carry weight and can provide information. Running a feature engineering method would have been ideal
  • I wanted to further my visualizations using geopy. Both datasets have longitude and lattitude coordinates. A future intention was to find a map of New york city and map of Chicago City and try plot the coordinates on a basemap so that we can visually see where the crashes occured in either city!
  • Finally, given the right scaling right feature engineering. I could have run some predictions and train some Machine Learning models.
  • And ask more questions such as Predicting crashes at a certain location in NEw york or CHicago? Predicting a crash rate with a chance of killing someone VS someone getting injured. These are more questions that could have been asked..
  • Finally, a project that now I will love to work on my free time to just explore the data further!
In [ ]: